Drift Management

DRAFT

This is currently DRAFT document is and subject to frequent change.

Overview

Proper deployment and configuration is crucial for production and significant effort is spent in pre-production to ensure that managed resources are correctly configured and deployed with specific, verified versions of software. The failure to detect and understand unplanned changes in the configuration or content of a managed resource increases the risk of operational failure and hinders troubleshooting efforts. Unplanned changes in configuration or in content is referred to as drift. Production, staging, and/or recovery configurations are designed to be identical in certain aspects in order to be resilient in the event of failures in production. When these configurations drift from one another, they leave what is commonly called a configuration gap between them. Configuration drift is a natural occurrence in enterprise and data center environments due to frequent hardware and software changes. Disaster recovery failures and HA system failures are frequently a result of configuration drift.

Scenarios

Let's consider some scenarios to illustrate how configuration drift can adversely affect an enterprise. These are intended to be motivating example to illustrate where/when configuration drift management might be applicable.

Scheduled Database Password Change

As part of regularly scheduled maintenance, the passwords for production databases will be updated. This will in turn require the data source configurations on production JBoss EAP servers to be updated with new passwords. The IT organization is not currently using RHQ and is instead using ad-hoc scripts to automate and apply the changes. Any number of things could happen in this scenario that could result in serious problems.

Programmatic errors in the script could result in an incorrect data source configuration keeping production EAP servers offline for an extended period of time as a result of not being able to communicate with the database. Or if the script needs to be manually run on each server, an administrator forget to apply the changes to one of the servers. These two problems would likely be remedied quickly as the source of the problem could be quickly and easily identified.

Now let's assume that the IT organization is using RHQ to manage its infrastructure. Administrators could create a group for the production EAP servers and then apply the data source configuration change as a group configuration update. Suppose the update fails on one of the EAP servers fails. Maybe the host machine is offline. RHQ will report the results of the operation, making it clear that the configuration update failed on one of the EAP servers. Now if the administrator is waiting intently for RHQ to report the results of the operation, the error is easily rectified. But suppose the administrator gets distracted by some other production issues and forgets about the configuration update. Discovering the problem will be delayed, requiring additional time and maintenance which adds to overall operational costs.

Database URL Change

Let's consider another example involving data source configuration changes. A new production database server has been installed and EAP servers need to be updated such that JDBC URLs refer to the new server. To make things more interesting, let's say that the old database server is kept running. Maybe it will be used for backups. Again let's start with an IT organization that is using an ah-hoc scripting solution to apply the updates. If administrators fail to update one of the EAP servers or if programmatic errors in the scripts result in one of the EAP servers not getting updated, this could result in serious problems that may not manifest themselves nearly as easily as in the database password example.

Now let's say that the IT organization is using RHQ to manage its EAP servers. Let's further suppose that there are advanced RHQ users/administrators who are utilizing the CLI to further automate the task. A CLI script is written to apply the group configuration update. The script iterates over each EAP server in the group and fetches and updates the resource for each data source as necessary using ConfigurationManager.updateResourceConfiguration. The script writer fails to understand a subtle yet important point which is when that method returns, the operation may not have completed. The script in effect assumes that since no exception has been thrown the updates completed successfully. Let's assume that one of the EAP servers fails to get updated for some reason. Although RHQ will report the failure, administrators may not recognize it soon enough still resulting in serious problems that are costly both in terms of time and money.

Increase heap size for EAP Servers

We run out of heap space for an EAP server in a production environment which results in some time. The fix is easy enough - increase the maximum heap size. Suppose that the heap size is increased as a one off change where an administrator logs onto the server, increases the heap size, and restarts the EAP server. Then at some point down the road, the machine on which the EAP server is running is taken offline for maintenance. Another EAP server is brought online. The problem though is that the new EAP server has the lower heap settings, and it too will eventually start running out of memory.

Deploy an Updated Version of an Application

Let's consider an example in which we are deploying an updated version of an internally hosted application in the form of an EAR or WAR. The application is running on a EAP cluster. This example is different from the previous one in that it deals with a content change as opposed to a configuration change. The application update is to be done as a rolling outage where one server will be taken out of rotation at a time, have the updated application deployed to it, and then go back into rotation. There are a number of different possible failure points. We will look at a few of them. First, we could wind up deploying the wrong version of the application. Depending on how rigorously we verify that have deployed the correct bits, the problem could go undetected for some time. A second problem that could arise is failing to update of the cluster nodes. Speaking from personal experience, I know that this can wind up being a difficult problem to debug. Lastly, suppose that the updated version of the application requires an update to some other library that is deployed separately and that library does not get updated.

Terminology

User Level and GUI Terminology

Drift

Unplanned changes in monitored files such as configuration files, or content.

Drift Monitoring

In RHQ, the general notion of periodically scanning the file system for changes to files that likely indicate drift.

Drift Detection Definition

A definition guiding drift monitoring for a resource. It specifies, using a base directory and includes/excludes filters, which directories should be monitored. Additionally, it sets monitoring properties such as scan interval, enablement, and drift handling options. Also called a Drift Definition. A resource can have zero or more drift definitions.

Drift Detection Run

A single execution of drift detection for a resource. In other words a single drift definition applied to a single resource, at a specific time.

Drift Instance

In RHQ, a specific occurrence of drift. Meaning a file change detected during a drift detection run. In the GUI, just known as Drift.

Note: In general a drift instance reflects an unexpected change. But RHQ does provide a 'planned changes' mode in the drift definition. Although drift detection is always performed the same way, RHQ will handle the drift instance differently in planned changes mode, specifically, by disabling alerting for the drift instance and omitting it from the drift history view.

Snapshot

The file-set (really, file-version-set, as it's not just file names, it's the actual bits) resulting from a drift detection run. In other words, a 'snapshot' of the actual files on disk at a particular time.

Initial Snapshot

The snapshot resulting from the first drift detection run for a drift detection definition. The initial snapshot is marked as version 0. Variations from the initial snapshot will generate drift.

Snapshot View, Snapshot Delta View

A snapshot always represents a full file-set, as it exists for the resource at the time of the drift detection run. The GUI provides two views of a snapshot. The 'Snapshot View' shows the complete file-set. The 'Snapshot Delta View' shows only the file differences between tow snapshots (typically the previous snapshot). Other than the initial snapshot, the snapshot delta view is the default. The user can toggle between views as desired.

Snapshot Diff

A diff between two snapshots which can be from the same resource or from different resources. The diff identifies files present in one snapshot and vice-versa. It also identifies the files that exist in both but whose content differs.

Pinned Snapshot

By default a drift detection run looks for changes between the previous snapshot and the current file system state. This rolling snapshot approach ensures each change to a file will result in only one drift instance. For more strict environments we offer the ability to always detect against a specific, or 'pinned' snapshot. The user can pin a snapshot via the GUI (or CLI). In this situation a drift detection run always compares the current file system to the pinned snapshot. This can result in the same drift being reported on each detection run, until the situation is remedied.

Drift Definition Template

A drift definition template is basically a preset drift definition, at the resource type level. It can be used to quickly creating and managing resource level drift definitions. Each type supporting drift detection will define at least one template in its plugin descriptor. Additionally, users can create user-defined templates. Drift definitions are always derived from a template, and are by default attached to the template. If attached, changes to the template will be pushed down to the definition.

Pinned Template

A drift template can be pinned by pinning a snapshot to it. The snapshot will then be pinned to all attached definitions for the template. In that way many resources can perform drift detection against a single, trusted snapshot.

Compliance

A Resource Type is in compliance (with respect to drift monitoring) unless one of its imported Resources in not in compliance.

A Resource in inventory is in compliance (with respect to drift monitoring) unless it:

Has one or more Drift Definitions and for at least one of those definitions:
1. The file system backing the resource is missing the definition's base directory
2. Is pinned and the file system backing the resource does not match the pinned snapshot (meaning there is active drift)

Once the file system has the base directory and, if pinned, matches exactly the pinned snapshot, the resource is said to be in compliance.

Remediation

The act of resolving drift. This is analogous to resolving a merge conflict in a version control system like Git. Resolving drift can be done in a number of ways including:

Revert back to a previous state
Acknowledge and accept the change
Change to something other than a previous state

Developer Level Terms

************************************************************************************
The following terms are used to describe implementation details and are
not terms an end user need know or understand.
************************************************************************************

Change Set

The report resulting from a drift detection run performed on the agent. The collection of changes where each change in the collection corresponds to a particular file. This is somewhat analogous to a commit in a version control system. Consider an initial commit for a file. With respect to the VCS, the only change represented by that commit is the addition of the file to version control. Each change set has an incrementing version associated with it, similar to say revision numbers in SVN.

Initial Change Set

Also known as the 'coverage changeset', the report resulting from the initial drift detection run for a definition. It reports the file-set to be used for subsequent drift determination, in essence the initial snapshot.

Drift Change Set

Reports on the set of changes between a resource and a previously known state of that resource. A change set entry corresponds directly to one file. It includes information to indicate whether the file has been added, modified or deleted from the set of files that are under drift monitoring. While a drift change set might have ten entries for example, indicating that ten files have changed, the change set does not provide any information as to the number or type of changes made for any particular file.

Drift File

Unique content corresponding to one or more files that are under drift monitoring. RHQ only stores one copy of a file's content (i.e., the actual bits). The content is uniquely identified by a SHA-256 hash. If multiple files, regardless of name, path, host machine, etc. have the same content as determined by the SHA-256 hash, then RHQ will only store one copy of that content for each of those files.

Drift Configuration

The actual RHQ Configuration-based domain object that backs a Drift Detection Definition.

Uses Cases

This section provides some discussion around the attached use case diagrams. There are several important concepts that are shared across these use cases. First, the agent only performs drift monitoring for a resource after the server has sent a request to do so. Drift monitoring is not enabled by default.

All monitoring is performed against the live resource, and the agent sends any changes that it detects up to the server for further analysis. You will see in the diagrams that the agent only sends changesets in an effort to minimize network IO.

All of the snapshots and changesets are stored on the server. The agent only stores the meta data of the snapshot used for comparison against the live resource. The agent does not store copies of the snapshot data.

U1: Start drift monitoring for an EAP server.

In this use case, we start drift monitoring for an EAP server where no snapshots previously exist. The agent takes a full snapshot of the EAP server and sends it up to the server. On the server, the user creates a drift alert definition such that an alert will be fired whenever drift is detected for the EAP server. A few changes are made to the EAP server (as separate, discrete events). The agent detects and reports these changes. On the server, changes are compared to determine whether or not there is drift. When drift is detected, the server checks to see if there are any drift alert definitions. Since a drift alert definition has been created, an alert is fired for each change that results in drift. Note that when a change occurs that does not result in drift, no alerts are fired.

U2: Start drift monitoring for an EAP server with a pre-existing baseline snapshot.

This is a variation of U1 where the main difference is in how the flow starts. Unlike in U1, here the server includes snapshot meta data in the drift monitoring request it sends to the agent. This means that the agent does not have to send an initial, full snapshot to the server. This distinction between U1 and U2 is important. We substantially reduce the amount of data that we send across the network, and we also reduce the work the agent has to do since no initial snapshot has to be generated. These savings can be large enough that they merit a separate case in U2.

U3: Baseline snapshot exists and user wants to deploy an application.

In U3, we have an EAP server that already has drift monitoring enabled. The user want to make a change by way of deploying an application to the EAP server. The interesting thing about this case is that we want to avoid false drift detection to the greatest extent possible. Deploying the application is a planned, expected change. Planned changed, either in terms of configuration or content, involve doing some additional work on the agent. The agent takes a full snapshot immediately prior to deploying the application and then takes another immediately after the application has been deployed. The delta between these changesets represents the expected, planned change, namely deploying an application. Then the baseline is updated/recalculated on the server.

Create Resource and Inventory Snapshots

An Inventory Snapshot provides information about a platform or group. An inventory snapshot is created to record the presence of resources in a group or platform and to confirm the the pool of known resources does not change.

A Resource Snapshot reports on a resource's configuration and content. Users will have the ability to manage Resource Snapshots. A resource snapshot can be created, deleted, and tagged. A snapshot can also be provisioned in order to repair, to configure, or to deploy applications. [Provide examples of provisioning snapshots to repair, configure, deploy]

Audit Trails

RHQ already provides a lot of functionality for creating and managing different types of snapshots via audit trails. RHQ maintains an audit trail for virtually every management function that it provides. The following subsections provide a brief overview for resource configuration and content audit trails. While these audit trails will not serve as the basis for resource snapshots, it is import to be aware of them and understand what functionality they provide.

Resource Configuration

RHQ maintains a history of resource configuration changes. When a configuration is modified through RHQ, the change is logged in the history. There is also a job that runs on the agent to detect configuration changes that occur outside of RHQ. These out-of-band changes are also recorded in the history. Each entry in the history stores a full copy or snapshot of the resource configuration. The history can be viewed and queried. Rollback is supported as well. Resource configuration history is represented by the ResourceConfigurationUpdate entity which maps to the RHQ_CONFIG_UPDATE table.

Content

The content audit trail is updated when content is added, updated, or deleted. To be precise there is a distinction to be made between adding and deploying content. Adding content might refer to uploading a package to the server while deploying refers to pushing those bits out to a managed resource. For the purposes of this discussion we are primarily interested in deploying content. The audit trail does not store copies of the actual bits of the content. It does however store the file name, size, and a checksum. The audit trail or history can be viewed and queried, but it does not support rollback like the resource configuration audit trail does. The content history is represented by the entity class InstalledPackageHistory which maps to the table RHQ_INSTALLED_PKG_HIST.

Inventory Snapshots

It was earlier stated that an inventory snapshot records the presence of resources in a group or platform. We can use a recursive dynagroup as a starting point for the snapshot. Group membership is periodically recalculated such that new members are automatically added. Since a snapshot implies immutability, there are a couple things we need to consider. First, a group merely maintains references to its member resources. For the snapshot though, we need a a copy (i.e., snapshot) of those member resources, or at least a copy of those parts of the resources for which we want to detect and report on drift. Secondly, when a change in membership is detected, we want to create a new group with the changed membership rather than the change being applied to the existing group.

Managing Snapshots

We need to provide the ability to create, delete, and tag inventory snapshots. There needs to be a UI for viewing snapshots. From that UI, users should be able to tag and delete snapshots. There needs to be the ability to create a snapshot for groups. Since the concept of an inventory snapshot is tied to group membership, we might want add a create snapshot button on the Inventory tab in the group UI where there are buttons for adding and removing members from the group.

Security should be role-based. We may need to introduce new roles for creating/deleting snapshots.

Resource Snapshots

A resource snapshot includes configuration and content. Snapshots can be managed such that they can be created, deleted, tagged, and used in provisioning. Storing copies of the bits may have serious performance implications in terms of space, speed, and network utilization. Disk space could quickly become an issue as snapshots are stored for a very large inventory or even for a moderately sized inventory consisting of really large resources (large as in the amount of space that the resource and its content consumes). Execution speed might also be issue both for the server and for the agent when streaming lots of large files. Lastly, streaming large amounts of data across the network could limit bandwidth and adversely affect other (non-RHQ) resources and users on the network.

So far we have discussed configuration and content snapshots somewhat independent of one another. We need to provide a unified view of the two and probably a unified data structure as well. As a motivating example, suppose we are testing an EAP deployment in a QA environment. Once we are satisfied from a QA standpoint we want to promote the EAP deployment to a production environment with the exception of security credentials and data source URLs. This would involving creating a snapshot of both the configuration and content of the EAP instance and then provisioning a production server with that snapshot. Bundles seem like a logical option here for a couple reasons. First, a bundle can consist of both configuration and content. Secondly, bundles are the vehicle for provisioning in RHQ today.

Managing Snapshots

We need to provide the ability to create, delete, and tag resource snapshots. There needs to be a UI for viewing snapshots. From the UI users should be able to tag and delete snapshots. Security should be role-based. We may need to introduce new roles for creating/deleting snapshots.

What Should Be Included in a Snapshot?

This section is focused on resource snapshots. A resource snapshot consists of two parts - the configuration and/or content data being copied and the meta data. The distinction between configuration and content is not really relevant. What is important is whether a file is text or binary. Snapshots will include a full copy of the bits of text-based files. For binary files such as JARs, WARs, and EARs, only the meta data will be included. That is, only the hash and path of binary files will be included in the snapshot. The reason for this is because RHQ does not have a content storage system that is suitable for handling increasingly large volumes of data.

Snapshots will include full copies of text files contained in exploded archive, but only meta data will be included for binary files in an exploded archive. Keep in mind that an exploded archive is in fact just a directory. Consider an exploded WAR. A snapshot should include full copies of HTML, CSS, JSP, etc. files. But library JAR files in WEB-INF/lib will only have their meta data included in the snapshot.

Let's consider a couple examples. Suppose we want to create a snapshot of a managed EAP server. The snapshot should include copies of run.conf, jboss-log4j.xml, jboss-server.xml, and data source files to name a few. For binary files like library JAR files and deployed applications in the form of WARs or EARs, the snapshot should only include the meta data.

Now suppose we want to generate a snapshot that includes only data source files and web applications. This suggests a need to for a filtering capability. The filter might specify that files matching the patterns *-ds.xml and *.war files are to be included.

Where Should Snapshots Be Stored?

If you go back and look at the attached use case diagrams, you will see that snapshots are stored on the server. While the merits of storing snapshots on the server versus storing them on the agent is debatable, there are a couple of use cases where storing snapshots on the agent does not make much sense. We want to support the ability to start drift monitoring for a resource from a pre-existing snapshot. This scenario is captured in use case U2. To support U2 we need to maintain an inventory of snapshots, and this is best done on the server side. Another scenario we want to support is comparing snapshots from different resources. Again, having some sort of centralized inventory of snapshots is key here. In addition to supporting these (and other) scenarios, storing snapshots on the server will help keep the foot print on the agent platform to a minimum.

We do need to store something on the agent. The agent will keep a copy of the snapshot meta data that is being used for comparison. The meta data will have to be stored on disk so that it can survive agent restarts.

What data structures will make up a snapshot?

It has already mentioned that a snapshot consists of data and meta data. The meta data will in effect be a hash table in which file system paths are mapped to hashes. This will be very similar to the FileHashcodeMap class used with bundles. The data will be stored in a compressed format. The content subsystem can be used for storing the snapshot bits.

How to handle plugin updates to the drift templates

Templates with new names can always be added.

Plugin defined templates that are not pinned and do not have attached definitions:

Can be updated with impunity.
Can be removed.

Plugin defined templates that are pinned or do have attached definitions:

Can have base information updated only
- If the update contains a directory change it will fail with an error in the server log.

Plugin defined templates that have attached definitions:

Can not be removed
- The template will be moved to user-defined to continue supporting the existing definitions. A warning will be logged in the server log.

Outstanding Issues/Questions

This section will provide a list of design and/or implementation issues that need to be discussed and worked out.

How do we ignore insignificant changes?

There may be changes that we want to ignore, like whitespace or comments, when the agent is looking at files. The agent is going to compare a file's hash against the corresponding hash in the meta data table. A whitespace change would result in a different hash. Here are a couple of ways to address this. First, we go ahead and let the agent report report and send the change up to the server. The server could perform a diff that ignores changes in whitespace. Alternatively, we could do the diff on the agent. This would require storing the "last known version" of the file on the agent.

Is the Content subsystem suitable for storing snapshots?

There are several things to consider here. First, we need the ability to associate a snapshot to a particular resource as well as allow for snapshots to exist independently of resources. I believe the content subsystem provides this flexibility.

The meta data would be stored separately from the snapshot bits. Each row in the meta data table (i.e., each file system path) would be stored as a row in the database. I have done some initial investigation looking at the file-hashcodes.dat file generated during a bundle deployment. For some EAP 5 bundle deployments, I had file-hashcodes.dat files that ranged between 2000 and 4000 rows. Using those numbers, the snapshot meta data table could easily get very large very fast. This calls into question whether or not we should explore other storage mechanisms. Exploring other options does not necessarily imply a one size fits all solutions. Many, maybe most RHQ deployments are smaller in size (in terms of the number of managed resources) and would be fine with the existing content storage solution. But there are larger deployments where the existing back end will not scale well. To accommodate different types of deployments, we should consider a pluggable back end for snapshots.

How if at all do we handle the race condition that exists during a planned change?

Use case U3 covers dealing with a planned change. One of the things we have to address in this scenario is avoiding false drift detection. The solution laid out in the diagrams is to take a full snapshot immediately before and immediately after the planned change. Then the delta between those two snapshots represents the planned, expected change. It is not that easy though because there is a sufficiently large window of opportunity for unplanned changes to to occur in this time frame. Drift could easily go undetected. Do we need to consider alternative approaches for dealing with planned changes, or is this window for unplanned changes an acceptable risk?

For what types of planned changes do we need to prepare?

Use case U3 describes a scenario involving a planned change by way of deploying content. We deal with the planned change by taking before/after snapshots. We need to follow the same procedure for planned configuration updates. Do we need to follow the same process for changes made via the bundle system? What about executing resource operations?

For which conditions do we want to be able to generate alerts?

Use case U1 shows a drift alert definition that causes an alert to be fired whenever drift is detected. What other conditions might users want to alert on? Here are some other conditions to consider,

Any change that occurs. That is the user wants to be notified any time a new snapshot is generated, regardless of whether or not there is drift.
A change that results in further drift. Drift has already been detected, and more changes are made that result in additional drift.
A change that eliminates or reduces drift. This could be a recovery alert, and it would not apply to planned changes since we are taking steps to avoid reporting planned changes.

How do we determine which files are text and which are binary?

Looking at file extensions is not a very robust solution. We simply cannot know all possible file extensions. More over, not all file names include an extension. There is the UNIX/Linux file command which provides a more robust way of determining a file type. Calling an external program like this would be a less than ideal solution as it would involve create a subprocess for every file scanned. Some tools/libraries that we might consider include,

RHQ 4.9

Drift Management

Overview

Scenarios

Scheduled Database Password Change

Database URL Change

Increase heap size for EAP Servers

Deploy an Updated Version of an Application

Terminology

User Level and GUI Terminology

Drift

Drift Monitoring

Drift Detection Definition

Drift Detection Run

Drift Instance

Snapshot

Initial Snapshot

Snapshot View, Snapshot Delta View

Snapshot Diff

Pinned Snapshot

Drift Definition Template

Pinned Template

Compliance

Remediation

Developer Level Terms

Change Set

Initial Change Set

Drift Change Set

Drift File

Drift Configuration

Uses Cases

Create Resource and Inventory Snapshots

Audit Trails

Resource Configuration

Content

Inventory Snapshots

Managing Snapshots

Resource Snapshots

Managing Snapshots

What Should Be Included in a Snapshot?

Where Should Snapshots Be Stored?

What data structures will make up a snapshot?

How to handle plugin updates to the drift templates

Outstanding Issues/Questions

How do we ignore insignificant changes?

Is the Content subsystem suitable for storing snapshots?

How if at all do we handle the race condition that exists during a planned change?

For what types of planned changes do we need to prepare?

For which conditions do we want to be able to generate alerts?

How do we determine which files are text and which are binary?